Firefighting Drone with Reinforcement Learning based Control

Tomasz Lewicki

Presentation Outline:

  • Motivation for the project
  • Environment + code walkthrough (key pieces)
  • Agent + code
  • Training + code & results

Code available: https://github.com/tomek-l/rl-firefighting

Motivation: Wildfire Problem

Wildfires are a growing problem. Record examples from recent years:

  • 2018 Camp Fire: Costliest natural disaster in world's history. Most destructive & deadliest in CA history
  • 2019 Austalian Bushfire: $46M$ acres (over $6 \times$ area of Belgium)
  • 2020 Wildfire Season in California: $5M$ acres (largest in California's recorded history)

image.png

Motivation pt. 2: Drones as a potential solution

  1. Can detect fires with camera+AI payload
  2. Can access places inaccessible to humans or wheeled robots
  3. Can be dispatched in much more dangerous scenarios than humans

drawing

Topic is an extention of my thesis project. In short:

  • Built a drone with Nvidia GPU processing onboard
  • Trained a Deep-CNN based infrared + RGB spectrum fire detection
  • IR + RGB perception system is capable of detecting fire in at $>95 \%$ accuracy
  • Validated over a real 80-acre fire incident

We have an autonomous drone with perception system. In this project I extend that by generating dynamic trajectories with RL, based on the observations made by the perception system.

(none of the components here are overlapping with my thesis)

image.png

image.png

Environment Model

  • cellular simulation of a wildfire
  • on $N \times N$ grid lattice
  • each cell of the lattice has:
    • state $ s \in \{0,1\}$ (i.e. intact or burning)
    • fuel $ f \in [0,255]$
  • in each step, every burning cell:
    • looses burn_rate amount of fuel
    • with certain probability $p \in [0,1]$ sets neighboring cells on fire

drawing

image.png

Cell simulation

There's much more to it, if you're interested Q&A

More info:

Julian and Kochenderfer Distributed Wildfire Surveillance with AutonomousAircraft using Deep Reinforcement Learning https://arxiv.org/abs/1810.04244

Haksar and Schwager Distributed Deep Reinforcement Learning for Fighting Forest Fires with a Network of Aerial Robots https://msl.stanford.edu/sites/g/files/sbiybj8446/f/root_0.pdf

Environment API

In [1]:
from environment import Environment

# Initialize environment with 10k cells
env = Environment(shape=(100, 100), tree_density=0.55, random_seed=260)
In [2]:
# Image with the map of fuel ("forrest") 45% cells are empty 
# and 55% have amounts of fuel drawn from uniform distribution in [0,255] range
env.fuel()
Out[2]:
In [3]:
# With .fire() method we can see the rendered map of fire
env.fire()
Out[3]:
In [4]:
# Finally, .snapshot() method has shows combined view
env.snapshot()
Out[4]:

Let's run a few steps

In [5]:
for _ in range(10):
    env.simulation_step()
    
env.snapshot()
Out[5]:

Let's run some more steps

In [6]:
for _ in range(100):
    env.simulation_step()
env.snapshot() # The wildfire simulation is unpredictable & extremely highly sensitive to initial conditions (just like in reality)
Out[6]:

Performance

In [7]:
# 10 000 cells @ ~2000 FPS
%timeit env.simulation_step()
550 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]:
# 1M cells @ ~20FPS
env = Environment(shape=(1000, 1000), tree_density=0.55, random_seed=260)

%timeit -n 100 env.simulation_step()
50.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]:
# snapshot with 1M cells
env.snapshot()
Out[9]:
In [10]:
# 16M cells ~1FPS
env = Environment(shape=(4000, 4000), tree_density=0.55, random_seed=None)

%time env.simulation_step()

# Beyond that, probably makes sense to move the simulation to GPU
CPU times: user 863 ms, sys: 48.1 ms, total: 911 ms
Wall time: 909 ms

API for RL training

  • Implemented from scratch by me
  • sticking to OpenAI gym API & conventions
In [11]:
# Standard methods for RL training

def step(self, drone_pos, action, sim_step_every=1):
    """
    environment step
    returns observation and reward
    """
    if self._step_cnt % sim_step_every == 0:
        self.simulation_step(burn_rate=3, ignition_prob=0.2)
    self._step_cnt += 1

    x, y = drone_pos
    observation = self.observe(drone_pos)
    reward = self.extinguish(drone_pos)

    return observation, reward

def reset(self):
    """
    restores the initial state of the environment
    """
    self._state_map = np.zeros(shape, dtype=np.uint8)
    self.ignite_center()
    

@property
def done(self):
    """
    Termminal state of the environment.
    Returns True if the fire
    a) was put out
    b) died out by itself
    """
    return True if self._state_map.sum() == 0 else False
In [12]:
def observe(self, drone_pos, fov=[5, 5]):
    """
    Get observations within fov from drone_pos
    """
    x, y = drone_pos
    fov_x, fov_y = int((fov[0] - 1) / 2), int((fov[1] - 1) / 2)
    view = self._state_map[y - fov_y : y + fov_y + 1, x - fov_x : x + fov_x + 1]

    return view
In [13]:
def extinguish(self, drone_pos, fov=[3, 3]):
    """
    extinguish cells within fov below drone_pos
    """
    x, y = drone_pos
    fov_x, fov_y = int((fov[0] - 1) / 2), int((fov[1] - 1) / 2)

    slice_y = slice(y - fov_y, y + fov_y + 1)
    slice_x = slice(x - fov_x, x + fov_x + 1)
    view = self._state_map[slice_y, slice_x]

    fire_count = self._state_map[slice_y, slice_x].sum()

    # set sate to 0 (extinguish)
    self._state_map[y - fov_y : y + fov_y + 1, x - fov_x : x + fov_x + 1] = 0

    return fire_count

For more implementation details:

https://github.com/tomek-l/rl-firefighting

Not limited to wildfires!

  • imagine cells can be shuffled (travel)
  • we have a classic SIR* epidemic model
  • fire is the virus
  • cells are the people
  • we model spread of the virus

*(SIR = Susceptible-Infectious-Removed)

Agent

  • Agent is a drone that 'lives' in the lattice environment
  • At each iteration the agent can take one of four actions: $ a \in \{N, S, W, E \} $
  • Actions are sampled using $\epsilon$ greedy approach:
    • Agent takes a radom action with probability $\epsilon$
    • Agent takes an action based on policy network with prob. $1 - \epsilon$
  • Agent gets reward proportional to the amount of fuel (trees) saved:
    • reward is normalized such that $r \in (-1, 1)$
    • for not extinguishing any fire, agent gets $r=-1$
    • for extinguishing segments with max. ammount of fuel, agent gets $r=1$
In [14]:
class Agent:
    def __init__(
        self,
        init_position=[50, 50],
        env_shape=(100, 100), 
        fov_shape=(7,7), # perception fov
        fov_extinguish=(5,5), # extinguish range
        lr=1e-6
    ):

        self._env_shape = env_shape
        self.position = np.array(init_position)
        self._x_bounds = range(fov_shape[1], env_shape[1]-fov_shape[0])
        self._y_bounds = range(fov_shape[0], env_shape[0]-fov_shape[1])
        self._actions, self._traj = self.generate_trajectory()
        self._fov_shape = fov_shape
        self._fov_extinguish = fov_extinguish
        self._view = np.zeros(fov_shape)
        
        self._policy = torch.nn.Sequential(
            torch.nn.Linear(fov_shape[0]*fov_shape[1], 2048), # input: (the 7x7 segment view)
            torch.nn.ReLU(inplace=True),
            torch.nn.Linear(2048, 2048),
            torch.nn.ReLU(inplace=True),
            torch.nn.Linear(2048, 4), # output: (the values of N, S, W, E actions)
            torch.nn.Sigmoid() # normalize 
        )
        
        self._optimizer = torch.optim.Adam(self._policy.parameters(), lr)
In [15]:
def action(self, epsilon):
    """
    sample and return action using epsilon-greedy approach
    epsilon is the fraction of random actions
    """
    # sample randomly
    Q_rand = self.random_policy()

    # forward pass on the policy neural net 
    view = self._view.flatten() # e.g. 5x5 view of environment state
    self._tensor_in = torch.tensor(view, dtype=torch.float32)
    self._tensor_out = self._policy(self._tensor_in)
    Q_policy = self._tensor_out.detach().numpy() # Q function values approximated by 

    # epsilon-greedy, biased coin toss
    Q = Q_rand if np.random.rand() < epsilon else Q_policy

    # assign -inf reward to actions that take us to infeasible states
    viable = self.viable_actions()
    Q[~viable] = - np.inf
    action = np.argmax(Q)

    # update position based on chosen action
    delta = action2delta[action]
    delta = list(reversed(delta)) # (x,y) -> (y,x)
    self.position += delta

    return action
In [16]:
action2delta = {
    0: (0, 1),  # x,y format
    1: (-1, 0),
    2: (0, -1),
    3: (1, 0),
}
# more details

For more implementation details:

https://github.com/tomek-l/rl-firefighting

Training loop

In [ ]:
from environment import Environment
from agent import Agent

EPISODES = 1000 # roughly ~12h overnight
MAX_ITER = 10000 # terminal state will likely be reached after 50-2000 iterations
EPSILON_SCHEDULE = np.linspace(1, 0.05, num=EPISODES) # descreasing epsilon 
SAVE_EVERY = 1 # save renders every # of episodes
LR = 1e-6

drone = Agent(init_position=[10, 10], lr=LR, fov_shape=(9,9), fov_extinguish=(6,6))
env = Environment(random_seed=None)

for episode, eps in zip(range(EPISODES), EPSILON_SCHEDULE):
    
    print(f"{episode}/{EPISODES}")
    env.reset()
    drone.position = np.random.randint(10,90,2) # respawn drone at a random positon
    frame_buffer = []
    
    for it in range(MAX_ITER):
        position = drone.position # drone's (x,y) pos in the env
        action = drone.action(epsilon=eps) # get action

        # environment step (little different than gym's API)
        env.step(position, action)
        drone.observation = env.observe(position, fov=drone._fov_shape) # make observation of env
        reward = env.extinguish(position, fov=drone._fov_extinguish) # reward, normalized in range (-1, 1)
        drone.backprop(reward) # backpropagate through policy network

        # render environment using snapshot() method and add to frame buffer
        frame_buffer.append(env.snapshot(drone.position))
         
        # in a terminal state -> next episode.  
        # (This means all fire extinguished, or died out)
        if env.done: break
            
    if episode % SAVE_EVERY == 0: drone.save_video(f"episode-{episode}.mp4", frame_buffer)

Result evaluation

Run each rollout executed twice with the same settings:

  • once with the firefighting drone
  • once without (just let it burn & die out)
  • count remaining fuel from each scenario
  • the difference between these two, normalized by total ammount of fuel is our efficiency. Fraction of fire contained by the drone
  • used as a metric of success

Efficiency

  • $ efficiency = \frac{\# fuel\ remaining\ (drone) - \# fuel\ remaining\ (no\ drone)}{\# total \ fuel} = \frac{\# saved \ fuel}{\# total \ fuel}$

  • $ efficiency = 1 $ means we save everything (impossible)

  • $ efficiency = 0 $ means we've done nothing
  • An optimal policy would yield the highest obtainable efficiency, close to $1$

After 1000 episodes, average efficiency on a 100 test rollouts in 100 randomly generated environments with:

  • env_shape=(100,100)
  • tree_density=0.55
  • burn_rate=3

Random policy: 0.040

Heuristic 1 (lawn-mover trajectory): 0.114

Heuristic 2 (spiral trajectory): 0.133

NN policy: 0.381

Future work

Left in backlog: integrating with a physical drone. This means implementing alternative versions of:

  • drone.observe(): implementation with:
      - instad of observing simulated environment 
      - actual IR & RGB cameras 
      - fire detection CNN from my thesis
  • drone.action():
      + instead of `self.position += delta` (virtually changing state of the drone)
      + actually send commands to drone using MAVLINK protocol

Q & A